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NONPARAMETRIC REGRESSION FOR LOCALLY STATIONARY 

TIME SERIES 1 

By Michael Vogt 

University of Cambridge 

In this paper, we study nonparametric models allowing for locally 
stationary regressors and a regression function that changes smoothly 
over time. These models are a natural extension of time series models 
with time- varying coefficients. We introduce a kernel-based method to 
estimate the time- varying regression function and provide asymptotic 
theory for our estimates. Moreover, we show that the main conditions 
of the theory are satisfied for a large class of nonlinear autoregres- 
sive processes with a time-varying regression function. Finally, we 
examine structured models where the regression function splits up 
into time-varying additive components. As will be seen, estimation 
in these models does not suffer from the curse of dimensionality. 

1. Introduction. Classical time series analysis is based on the assump- 
tion of stationarity. However, many time series exhibit a nonstationary be- 
havior. Examples come from fields as diverse as finance, sound analysis and 
neuroscience. 

One way to model nonstationary behavior is provided by the theory of 
locally stationary processes introduced by Dahlhaus; cf. [5, 6] and [7]. Intu- 
itively speaking, a process is locally stationary if over short periods of time 
(i.e., locally in time) it behaves in an approximately stationary way. So far, 
locally stationary models have been mainly considered within a parametric 
context. Usually, parametric models are analyzed in which the coefficients 
are allowed to change smoothly over time. 

There is a considerable amount of papers that deal with time series mod- 
els with time- varying coefficients. Dahlhaus et al. [8], for example, study 
wavelet estimation in autoregressive models with time-dependent parame- 
ters. Dahlhaus and Subba Rao [9] analyze a class of ARCH models with time- 
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varying coefficients. They propose a kernel-based quasi- maximum likelihood 
method to estimate the parameter functions; a kernel-based normalized- 
least-squares method is suggested by Fryzlewicz et al. [10]. Hafner and Lin- 
ton [12] provide estimation theory for a multivariate GARCH model with 
a time-varying unconditional variance. Finally, a diffusion process with a 
time-dependent drift and diffusion function is investigated in Koo and Lin- 
ton [14]. 

In this paper, we introduce a nonparametric framework which can be 
regarded as a natural extension of time series models with time-varying 
coefficients. In its most general form, the model is given by 

(1) Y ttT = m(l,Xt tT }+et,T fort = l,...,T 

with E[et = 0, where YtT an d XtT are random variables of dimen- 

sion 1 and d, respectively. The model variables are assumed to be locally sta- 
tionary and the regression function as a whole is allowed to change smoothly 
over time. As usual in the literature on locally stationary processes, the func- 
tion m does not depend on real time t but rather on rescaled time ^. This 
goes along with the model variables forming a triangular array instead of a 
sequence. Throughout the Introduction, we stick to an intuitive concept of 
local stationarity. A technically rigorous definition is given in Section 2. 

There is a wide range of interesting nonlinear time series models that fit 
into the general framework (1). An important example is the nonparametric 
autoregressive model 

(2) Y tjT = m(^,Y t - hT ,...,Y t _ dtT j+E tjT fort = l,...,T 

with E[£ 4j t| Yj-i^, . . . , Y t -d,T\ = 0, which is analyzed in Section 3. As will be 
seen there, the process defined in (2) is locally stationary and strongly mixing 
under suitable conditions on the function m and the error terms e%t- Note 
that independently of the present work, Kristensen [16] has developed results 
on local stationarity of the process given in (2) under a set of assumptions 
similar to ours. 

In Section 4, we develop estimation theory for the nonparametric regres- 
sion function in the general framework (1). As described there, the regres- 
sion function is estimated by nonparametric kernel methods. We provide 
a complete asymptotic theory for our estimates. In particular, we derive 
uniform convergence rates and an asymptotic normality result. To do so, 
we split up the estimates into a variance part and a bias part. In order 
to control the variance part, we generalize results on uniform convergence 
rates for kernel estimates as provided, for example, in Bosq [3], Masry [18] 
and Hansen [13]. The locally stationary behavior of the model variables also 
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changes the asymptotic analysis of the bias part. In particular, it produces 
an additional bias term which can be regarded as measuring the deviation 
from stationarity. 

Even though model (1) is theoretically interesting, it has an important 
drawback. Estimating the time-varying regression function in (1) suffers 
from an even more severe curse of dimensionality problem than in the stan- 
dard, strictly stationary setting with a time-invariant regression function. 
The reason is that in model (1), we fit a fully nonparametric function m(u, •) 
locally around each rescaled time point u. Compared to the standard case, 
this means that we additionally smooth in time direction and thus increase 
the dimensionality of the estimation problem by one. This makes the pro- 
cedure even more data consuming than in the standard setting and thus 
infeasible in many applications. 

In order to countervail this severe curse of dimensionality, we impose 
some structural constraints on the regression function in (1). In particular, 
we consider additive models of the form 



with X t>T = (Xj T , . . . , Xf T ) and E [e^tI^t] = 0. In Section 5, we will show 
that the component functions of this model can be estimated with two- 
dimensional nonparametric convergence rates, no matter how large the di- 
mension d. In order to do so, we extend the smooth backfitting approach of 
Manxmen et al. [17] to our setting. 

2. Local stationarity. Heuristically speaking, a process {Xt t T '■ t = 1, . . . , 
T}rp =1 is locally stationary if it behaves approximately stationary locally 
in time. This intuitive concept can be turned into a rigorous definition in 
different ways. One way is to require that locally around each rescaled time 
point u, the process {Xt t r} can be approximated by a stationary process 
{Xt(u) :t G Z} in a stochastic sense; cf., for example, Dahlhaus and Subba 
Rao [9]. This idea also underlies the following definition. 

Definition 2.1. The process {Xfr} is locally stationary if for each 
rescaled time point u 6 [0, 1] there exists an associated process {Xt(u)} with 
the following two properties: 

(i) {X t (u)} is strictly stationary with density fx t (u)\ 

(ii) it holds that 



(3) 
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where {Ut t T{u)} is a process of positive variables satisfying K[(Ut t T{ , u)) ( '] < C 
for some p > and C < oo independent of u, t, and T. || • || denotes an 
arbitrary norm on W 1 . 

Since the pth. moments of the variables Ut T( u ) are uniformly bounded, 
it holds that Utpiu) = O p (l). As a consequence of the above definition, we 
thus have 

\\X ttT -X t (u)\\=0. 

The constant p can be regarded as a measure of how well XtT is approxi- 
mated by Xt(u): the larger p can be chosen, the less mass is contained in 
the tails of the distribution of Ut,T{v). Thus, if p is large, then the bound 
(| 7^ — ii | + ijt)Ut t T( u ) w iH take rather moderate values for most of the time. 
In this sense, the bound and thus the approximation of X^t by Xf(u) is 
getting better for larger p. 

3. Locally stationary nonlinear AR models. In this section, we exam- 
ine a large class of nonlinear autoregressive processes with a time-varying 
regression function that fit into the general framework (1). We show that 
these processes are locally stationary and strongly mixing under suitable 
conditions on the model components. To shorten notation, we repeatedly 
make use of the following abbreviation: for any array of variables {Zt^}, we 
let Z\-* := (Z t - k ,T, Z t:T ) for k > 0. 

3.1. The time-varying nonlinear AR (tvNAR) process. We call an array 
{Yt T '■ t £ ^}t=i a time- varying nonlinear autoregressive (tvNAR) process if 
Y t> T evolves according to the equation 

(4) Ytr = m(±,Y£* T } +a(|,l££ T )et. 

A tvNAR process is thus an autoregressive process of form (2) with er- 
rors et t T = o~(-f,Yt-iT) £ t- I n t ne above definition, m(u,y) and a(u,y) are 
smooth functions of rescaled time u and y G We stipulate that for 
u < 0, m(u, y) = m(0, y) and a(u, y) = cr(0, y). Analogously, we set m(u, y) = 
m(l,y) and a(u,y) = a(l,y) for u > 1. Furthermore, the variables Et are 
assumed to be i.i.d. with mean zero. For each we additionally define 

the associated process {Y t (u) by 

(5) Y t (u) = m(u, Y£?(u)) + a(u, Y^{u))e u 

where the rescaled time argument of the functions m and a is fixed at u. 

As stipulated above, the functions m and a in (4) do not change over time 
for t < 0. Put differently, Y t)T = m(0, Y*zf T ) + <r(0, Y*~£ T )e t for all t < 0. We 
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can thus assume that Y^t — ^t(O) for t < 0. Consequently, if there exists a 
process (1^(0)} that satisfies the system of equations (5) for u = 0, then 
this immediately implies the existence of a tvNAR process {^,t} satisfying 
(4). As will turn out, under appropriate conditions there exists a strictly 
stationary solution {Yt(u)} to (5) for each «£E, in particular for u = 0. 
We can thus take for granted that the tvNAR process {^,t} defined by (4) 
exists. 

Before we turn to the analysis of the tvNAR process, we compare it to the 
framework of Zhou and Wu [24] and Zhou [23] . Their model is given by the 
equation Z^t = G(Jp,^t), where ipt = (■•■■> £t-l,£t) with i.i.d. variables St and 
G is a measurable function. In their theory, the variables Zt(u) = G(u,ipt) 
play the role of a stationary approximation at u 6 [0, 1] . Under suitable 
assumptions, we can iterate equation (5) to obtain that Yt(u) = F(u,ipt) for 
some measurable function F. Note, however, that Y^j* 7^ F(^,t(jt) m general. 
This is due to the fact that when iterating (5), we use the same functions 
m(u, •) and o~(u, •) in each step. In contrast to this, different functions show 
up in each step when iterating the tvNAR variables Y^t- Thus, the relation 
between the tvNAR process and the approximations {Y t (u)} is in 

general different from that between the processes {Zt^} and {Zt(u)} in the 
setting of Zhou and Wu. 

3.2. Assumptions. We now list some conditions which are sufficient to 
ensure that the tvNAR process is locally stationary and strongly mixing. To 
start with, the function m is supposed to satisfy the following conditions: 

(Ml) m is absolutely bounded by some constant C m < oo. 

(M2) m is Lipschitz continuous with respect to rescaled time u, that is, 
there exists a constant L < oo such that \m(u,y) — m(u' ,y)\ < L\u — u'\ for 
all y G R d . 

(M3) m is continuously differentiable with respect to y. The partial deriva- 
tives djm(u,y) := -^-m(u,y) have the property that for some K\ < oo, 

sup \djm(u, y)\ < 8 < 1. 
u£R,\\y\\ a c>K 1 

An exact formula for the bound 5 is given in (31) in Appendix A. 

The function a is required to fulfill analogous assumptions. 

(SI) a is bounded by some constant C a < oo from above and by some 
constant c CT > from below, that is, < c CT < a(u,y) < C a < oo for all u 
and y. 

(£2) a is Lipschitz continuous with respect to rescaled time u. 
(S3) a is continuously differentiable with respect to y. The partial deriva- 
tives dja{u,y) := -£pa(u,y) have the property that for some K\ < oo, 

\dja(u,y)\ < 5 < 1 for all u £ R and ||y||oo > K\. 



G 



M. VOGT 



Finally, the error terms are required to have the following properties. 

(El) The variables e% are i.i.d. with E[et] = and E|ef| 1+,? < oo for some 
rj > 0. Moreover, they have an everywhere positive and continuous density f e . 

(E2) The density f e is bounded and Lipschitz, that is, there exists a 
constant L < oo such that \f £ (z) — fe{z')\ < L\z — z'\ for all z, z' € R. 

To show that the tvNAR process is strongly mixing, we additionally need 
the following condition on the density of the error terms: 

(E3) Let do, d\ be any constants with < do < Dq < oo and \dx\ < D\ < 
oo. The density f e fulfills the condition 

/ |/ e ([l + d ]z + d X ) - f £ (z)\ dz < C D(hDl (d + \dx\) 

with Co ,Di < oo only depending on the bounds Do an d D\. 
We shortly give some remarks on the above conditions: 

(i) Our set of assumptions can be regarded as a strengthening of the 
assumptions needed to show geometric ergodicity of nonlinear AR processes 
of the form Yt = m(Y^Zx) + criXiZx) e t- The marn assumption in this context 
requires the functions m and a not to grow too fast outside a large bounded 
set. More precisely, it requires them to be dominated by linear functions 
with sufficiently small slopes; cf. Tj0stheim [21], Bhattacharya and Lee [2], 
An and Huang [1] or Chen and Chen [4], among others. (M3) and (S3) are 
very close in spirit to this kind of assumption. They restrict the growth of 
m and a by requiring the derivatives of these functions to be small outside 
a large bounded set. 

(ii) If we replace (M3) and (S3) with the stronger assumption that the 
partial derivatives \djm(u,y)\ and \dj<j(u,y)\ are globally bounded by some 
sufficiently small number 5 < 1, then some straightforward modifications 
allow us to dispense with the boundedness assumptions (Ml) and (SI) in 
the local stationarity and mixing proofs. 

(iii) Condition (M3) implies that the derivatives djm(u,y) are absolutely 
bounded. Hence, there exists a constant A < oo such that \djm(u,y)\ < A 
for all u £ R and y € R rf . Similarly, (S3) implies that the derivatives dj<r(u, y) 
are absolutely bounded by some constant A < oo. 

(iv) As already noted, (E3) is only needed to prove that the tvNAR 
process is strongly mixing. It is, for example, fulfilled for the class of bounded 
densities f e whose first derivative f' £ is bounded, satisfies f \zf' e (z)\ dz < oo 
and declines monotonically to zero for values \z\ > C for some constant 
C > 0; see also Section 3 in Fryzlewicz and Subba Rao [11] who work with 
assumptions closely related to (E3). 
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3.3. Properties of the tvNAR process. We now show that the tvNAR 
process is locally stationary and strongly mixing under the assumptions 
listed above. In addition, we will see that the auxiliary processes {Yt(u)} 
have densities that vary smoothly over rescaled time u. As will turn out, 
these three properties are central for the estimation theory developed in 
Sections 4 and 5. 

The first theorem summarizes some properties of the tvNAR process and 
of the auxiliary processes {Y t (u)} that are needed to prove the main results. 

Theorem 3.1. Let (Ml)-(MS), (Y,1)-(Y<3) and (El) be fulfilled. Then: 

(i) for each u£R, the process {Yt(u),t 6 Z} has a strictly stationary 
solution with et independent ofY t _k(u) for k > 0; 

(ii) the variables Y^Z\{u) have a density fy*- 4 ^) w - r -t- Lebesgue mea- 
sure; 

(iii) the variables Y^~^ T have densities f y t-a w.r.t. Lebesgue measure. 

The next result states that {V^t} can be locally approximated by {Yt(u)}. 
Together with Theorem 3.1, it shows that the tvNAR process {^,r} is locally 
stationary in the sense of Definition 2.1. 

Theorem 3.2. Let (M1)-(M3), (L1)-(E3) and (El) be fulfilled. Then 

t 



(6) \Y t , T -Y t {u)\< 



u 



+ y ) 



where the variables Ut t T(u) have the property that E[(C/t i r(w)) p ] < C for some 
p > and C < oo independent of u, t and T . 

To get an idea of the proof of Theorem 3.2, consider the model Yt T = 
m(|,it-i,T) + £t for a moment. Our arguments are based on a backward 
expansion of the difference 1^,T — Yt(u). Exploiting the smoothness condi- 
tions of (M2) and (M3) together with the boundedness of m, we obtain 
that 



it-k) 



n—l r r \ n 

\Y ttT -Y t (u)\<CY,H\dm(u,^k)\( f -« + £ ) + C\[\dm{u^ 

r=0fc=l ^ ' k=l 

where dm(u,y) is the derivative of m(u,y) with respect to y and £t-k 1S an 
intermediate point between Y t _^^ and Y t _^{u). To prove (6), we have to 
show that the product YYk=i \9m(u,^t-k)\ is contracting in some stochastic 
sense as n tends to infinity. The heuristic idea behind the proof is the fol- 
lowing: using conditions (Ml) and (El), we can show that at least a certain 
fraction of the terms £t-i> ■■■■> Ct-n take a value in the region {y : \y\ > K{\ 
as n grows large. Since the derivative \dm\ is small in this region according 
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to (M3), this ensures that at least a certain fraction of the elements in the 
product YYk=i\^ m ( u ^t-k)\ are small in value. This prevents the product 
from exploding and makes it contract to zero as n goes to infinity. 

Next, we come to a result which shows that the densities of the approxi- 
mating variables Y^~^(u) change smoothly over time. 

Theorem 3.3. Let f(u,y) '■= f Y t - d ( u )(y) be tfie density of Y*~f(u) at 
yeR d . If (Ml)-(MS), (El)-(ES) and (El), (E2) are fulfilled, then 

\f(u,y)-f(v,y)\<C y \u-v\P 
with some constant <p < 1 and C y < oo continuously depending on y. 

We finally characterize the mixing behavior of the tvNAR process. To do 
so, we first give a quick reminder of the definitions of an a- and /3-mixing 
array. Let (Q,A,F) be a probability space, and let B and C be subfields of A. 
Define 

a(B,C)= sup \¥(BDC)-F(B)F(C)\, 
BeB,cec 

/3(B,C) = Esup|P(C)-P(C|B)|. 

cec 

Moreover, for an array {Zt t T 'A <t <T}, define the coefficients 

(7) a(k) = sup a(a(Z St T, 1 < s < t), a(Z s ^T, t + k < s < T)), 

t,T: l<t<T-k 

(8) p(k)= sup P(o-(Z s , T ,l<s<t),a(Z StT ,t + k<s<T)), 

t,T: l<t<T-k 

where o~(Z) is the a-field generated by Z. The array {^j,t} is said to be 
a-mixing (or strongly mixing) if a(k) — > as k — > oo. Similarly, it is called 
/3-mixing if /3(fc) — > 0. Note that /3-mixing implies a-mixing. The final result 
of this section shows that the tvNAR process is /3-mixing with coefficients 
that converge exponentially fast to zero. 

Theorem 3.4. // (Ml)-(MS), (E1)-(E3) and (El)-(ES) are fulfilled, 
then the tvNAR process {^t,r} is geometrically /3-mixing, that is, there exist 
positive constants 7 < 1 and C < 00 such that (3(k) < Cj k . 

The strategy of the proof is as follows: the (conditional) probabilities that 
show up in the definition of the /3-coefficient in (8) can be written in terms 
of the functions m, a and the error density f e . To do so, we derive recursive 
expressions of the model variables Y% t an d of certain conditional densities 
of Yt^T- Rewriting the /3-coefficient with the help of these expressions allows 
us to derive an appropriate bound for it. The overall strategy is thus similar 
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to that of Pryzlewicz and Subba Rao [11] who also derive bounds of mixing 
coefficients in terms of conditional densities. The specific steps of the proof, 
however, are quite different. The details together with the proofs of the other 
theorems can be found in Appendix A. 

4. Kernel estimation. In this section, we consider kernel estimation in 
the general model (1), 

Y t ,T = m(t,Xt, T } +e t , T for t = l,...,T 

with Efe^xl-XfjT] — 0. Note that m(^, •) is the conditional mean function in 
model (1) at the time point t. The function m is thus identified almost surely 
on the grid of points y for t = 1, . . . , T. These points form a dense subset of 
the unit interval as the sample size grows to infinity. As a consequence, m is 
identified almost surely at all rescaled time points u G [0, 1] if it is continuous 
in time direction (which we will assume in what follows). 

4.1. Estimation procedure. We restrict attention to Nadaraya- Watson 
(NW) estimation. It is straightforward to extend the theory to local linear 
(or more generally local polynomial) estimation. The NW estimator of model 
(1) is given by 

, Q , „, , EL K h {u - t/T) J]* K h (xi - xl T )Yt, T 

(9) m(u,x) = m — -, —■ • 

Y,tiKh{u-t/T)Y[ d j=1 K h (xi-Xy) 

Here and in what follows, we write X ti T = (X} T , . . . , Xf T ) and x = (a; 1 , . . . , x d ) 
for any vector x G W 1 , that is, we use subscripts to indicate the time point 
of observation and superscripts to denote the components of the vector. K 
denotes a one-dimensional kernel function and we use the notation K^{v) = 
K(V). For convenience, we work with a product kernel and assume that the 
bandwidth h is the same in each direction. Our results can, however, be 
easily modified to allow for nonproduct kernels and different bandwidths. 

The estimate defined in (9) differs from the NW estimator in the stan- 
dard strictly stationary setting in that there is an additional kernel in time 
direction. We thus do not only smooth in the direction of the covariates X^ t 
but also in the time direction. This takes into account that the regression 
function is varying over time. In what follows, we derive the asymptotic 
properties of our NW estimate. The proofs are given in Appendix B. 

4.2. Assumptions. The following three conditions are central to our re- 
sults: 

(CI) The process {X^t} is locally stationary in the sense of Defini- 
tion 2.1. Thus, for each time point u G [0,1], there exists a strictly sta- 
tionary process {X t (u)} having the property that \\X tj T — X t (u)\\ < (|^ — 
u\ + ^)U t>T {u) a.s. with E[(U ttT (u)) p ] < C for some p > 0. 
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(C2) The densities f(u,x) := fx t (u)( x ) of the variables X t (u) are smooth 
in u. In particular, f(u,x) is differentiable w.r.t. u for each x G R d , and the 
derivative dof(u,x) := J^f(u,x) is continuous. 

(C3) The array {Xt t T,£t,T\ is a-mixing. 

As seen in Section 3, these three conditions are essentially fulfilled for the tv- 
NAR process: (CI) and (C3) follow immediately from Theorems 3.2 and 3.4. 
Moreover, Theorem 3.3 shows that the tvNAR process satisfies a weakened 
version of (C2) which requires the densities fx t (u) to be continuous rather 
than differentiable in time direction. Note that we could do with this weak- 
ened version of (C2), however at the cost of getting slower convergence rates 
for the bias part of the NW estimate. 

In addition to the above three assumptions, we impose the following reg- 
ularity conditions: 

(C4) f(u,x) is partially differentiable w.r.t. x for each uG [0,1]. The 
derivatives djf(u, x) := ^|j/(u, x) are continuous for j = 1, . . . , d. 

(C5) m(u, x) is twice continuously partially differentiable with first deriva- 
tives djm(u, x) and second derivatives dfjin(u, x) for i,j = 0, . . . ,d. 

(C6) The kernel K is symmetric about zero, bounded and has compact 
support, that is, K(v) = for all \v\ > C\ with some C\ < oo. Furthermore, 
K is Lipschitz, that is, \K(v) — K(v')\ < L\v — v'\ for some L < oo and all 
v,v' G R. 

Finally, note that throughout the paper the bandwidth h is assumed to 
converge to zero at least at polynomial rate, that is, there exists a small 
£ > such that h < CT~^ for some constant C > 0. 

4.3. Uniform convergence rates for kernel averages. As a first step in the 
analysis of the NW estimate (9), we examine kernel averages of the general 
form 

1 T ( t\ d 
(10) ^(u, x) = — - W«-- n^'- Xl T )Wt,T 

t=l V / j=l 

with {VFt.r} being an array of one-dimensional random variables. A wide 
range of kernel-based estimators, including the NW estimator defined in 
(9), can be written as functions of averages of the above form. The asymp- 
totic behavior of such averages is thus of wider interest. For this reason, we 
investigate the properties of these averages for a general array of variables 
{W*,t}- Later on we will employ the results with Wt,T = 1 and Wt,T =£t,T- 
We now derive the uniform convergence rate of ip(u,x) — ¥,ip(u,x). To do 
so, we make the following assumptions on the components in (10): 

(Kl) It holds that E|Ty tjT | s < C for some s > 2 and C < oo. 
(K2) The array {X^t, W^t} is a-mixing. The mixing coefficients a have 
the property that a(k) < Ak~^ for some A < oo and j3 > 2 ^Z<2 ■ 
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(K3) Let fx t T and fx t T ,x t+l T be the densities of X tt r and 
(X tj T, X t+ ip), respectively. For any compact set 5Cl d , there exists a 
constant C = C{S) such that sup t T sup xG 5 fx t T {x) < C and 
SVi Vt,T SVt -Vx&s^^Wt : T\ s \Xt,T = x\fx t T (x) < C. Moreover, there exists 
a natural number I* < oo such that for all I > I*, 
sup t)T sup x y eS E[\W t:T \\W t+ i tT \\X t:T = x,X t+ ^ T = x']f Xt!T ,x t+l , T (x,x') < C. 

The next theorem generalizes uniform convergence results of Hansen [13] 
for the strictly stationary case to our setting. See Kristensen [15] for related 
results. 

Theorem 4.1. Assume that (K1)-(K3) are satisfied with 

(11) 2 + .(l + (d + l)) 

s — 2 

and that the kernel K fulfills (C6). In addition, let the bandwidth satisfy 

no\ ^ T logT m 

( 12 ) T e h d+i =°( 1 ) 

with (px slowly diverging to infinity (e.g., (fix = log log T) and 
h o, a f3(l-2/s)-2/s-l-(d+l) 

(13) 9= + 3-(d+l) • 

Finally, let S be a compact subset o/IR d . T/ien it holds that 



sup |$(u, x) - EV>(«, x)| = OJ W . 

The convergence rate in the above theorem is identical to the rate obtained 
for a (d+ l)-dimensional nonparametric estimation problem in the standard 
strictly stationary setting. This reflects the fact that additionally smoothing 
in time direction, we essentially have a {d + l)-dimensional problem in our 
case. Moreover, note that with (11) and (13), we can compute that 8 £ (0, 1 — 
|]. In particular, 9 = 1 — | if the mixing coefficients decay exponentially 
fast to zero, that is, if (3 = oo. Restriction (12) on the bandwidth is thus a 
strengthening of the usual condition that Th d+l — > oo. 

4.4. Uniform convergence rates for NW estimates. The next theorem 
characterizes the uniform convergence behavior of our NW estimate. 

Theorem 4.2. Assume that (C1)-(C6) hold and that (K1)-(K3) are 
fulfilled both for Wt x = 1 an d Wt x = e t,T- Let (3 satisfy (11) and suppose that 
' m ^ue[o,i],xes f( u i x ) > 0- Moreover, assume that the bandwidth h satisfies 

riK\ ^rlogT m , 1 m 

(15) , , i =o(l and —7: = o(l 
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with 9 given in (13), 4>t = log log T, r = min{p, 1} and p introduced in (CI). 
Defining 1^ = [C\h, 1 — C\h\, it then holds that 



To derive the above result, we decompose the difference rh(u,x) — m{u,x) 
into a stochastic part and a bias part. Using Theorem 4.1, the stochastic 
part can be shown to be of the order O p (i/log T /Th d+1 ). The bias term 
splits up into two parts, a standard component of the order 0(h?) and 
a nonstandard component of the order 0(T~ r h~ d ). The latter component 
results from replacing the variables Xt t T by Xt(^) in the bias term. It thus 
captures how far these variables are from their stationary approximations 
X t {^). Put differently, it measures the deviation from stationarity. As will be 
seen in Appendix B, handling this nonstationarity bias requires techniques 
substantially different from those needed to treat the bias term in a strictly 
stationary setting. 

Note that the additional nonstationarity bias converges faster to zero for 
larger r = min{p, 1}. This makes perfect sense if we recall from Section 2 
that r measures how well Xt t T is locally approximated by X^(^): the larger 
r, the smaller the deviation of XfT from its stationary approximation and 
thus the smaller the additional nonstationarity bias. 

4.5. Asymptotic normality. We conclude the asymptotic analysis of our 
NW estimate with a result on asymptotic normality. 

Theorem 4.3. Assume that (C1)-(C6) hold and that (K1)-(K3) are 
fulfilled both for W t , T = 1 and W t , T = £t,T- Let p > 4 and T r h d+2 -> oo 
with r = min{/3, 1}. Moreover, suppose that f(u,x) > and that a 2 (^,x) := 
~E[e 2 T \X ty r = x] is continuous. Finally, let r > to ensure that the band- 
width h can be chosen to satisfy Th d+5 — > Ch for a constant Ch ■ Then 



The above theorem parallels the asymptotic normality result for the stan- 
dard strictly stationary setting. In particular, the bias and variance expres- 
sions B UjX and V U}X are very similar to those from the standard case. By 
requiring that T r h d+2 — > oo, we make sure that the additional nonstation- 
arity bias is asymptotically negligible. 



(16) 
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5. Locally stationary additive models. We now put some structural con- 
straints on the regression function m in model (1). In particular, we assume 
that for all rescaled time points u G [0, 1] and all points x in a compact subset 
of M. d , say [0, l] d , the regression function can be split up into additive com- 
ponents according to m(u,x) = mo(u) + Y2j=i m j( u , x] )- This means that 
for x G [0, l] d , we have the additive regression model 



(18) 



E[Y t>T \X t>T = x]=m (£j + m 3 (|> ■ 



To identify the component functions of model (18) within the unit cube 
[0, l] d , we impose the condition that f mj(u,x^)pj(u,x^) dxi = for all j = 
l,...,d and all rescaled time points u G [0, 1]. Here, the functions pj(u,x^) = 
j p(u, x) dx~ J are the marginals of the density 

i(x e[o,i] d ) f(u,x) 

p(u,x) - 



P(X (u)G[0,l] d ) ' 

where as before f(u,-) is the density of the strictly stationary process {Xt(u)}. 
Note that this normalization of the component functions varies over time in 
the sense that for each rescaled time point u, we integrate with respect to a 
different density. 

To estimate the functions mo, • • . , ma, we adapt the smooth backfitting 
technique of Mammen et al. [17] to our setting. To do so, we first introduce 
the auxiliary estimates 



T 

p(u,x 



r [o, 



— £ I(X t)T G [0, l\ d )K h (u, f[ Kh&iXi 
m(«, x) = V I(X t>T G [0, l] d )K h (u, ±) TT K h (xi,x{ T )Y t , T /p(u, x) 



t=i ^ ' j=i 



p(u,x) is a kernel estimate of the density p(u,x), and rh(u,x) is a (d+ 1)- 
dimensional NW smoother that estimates m(u, x) for x G [0, l] d . In the above 
definitions, 

T 

r [o,H d = E K ^ ( u > f) 1{ ~ x ^ G [°> ^ 

t=l ^ ' 

is the number of observations in the unit cube [0, l] d , where only time points 
close to u are taken into account, and 

K h (v,w)=I(v,we[0,l})- r K ' 



Jo K h {s-w)ds 
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is a modified kernel weight. This weight has the property that JqK^v, w)dv = l 
for all w € [0,1], which is needed to derive the asymptotic properties of the 
backfitting estimates. 

Given the smoothers p and m, we define the smooth backfitting estimates 
rho(u), fhi(u, •),... ,rhd(u, •) of the functions mo(u), m\{u, •),... ,md(u, •) at 
the time point u € [0, 1] as the minimizers of the criterion 

(19) J ^m(u, w) - g - ^ffj(w J ) j p(u,w)dw, 

where the minimization runs over all additive functions g(x) = go + g\{x l ) + 
• " + 9d(x d ) whose components are normalized to satisfy f gj(w 3 )pj(u, 
w 3 )dw 3 = for j = 1, . . . , d. Here, pj(u, x 3 ) = J p(u, x) dx~ 3 is the marginal 
of the kernel density p(u, •) at the point x 3 . 

According to (19), the backfitting estimate m(u, •) = rho(u) + Ylj=i rhj{u, •) 
is an ^-projection of the full-dimensional NW estimate m(u, •) onto the sub- 
space of additive functions, where the projection is done with respect to the 
density estimate p(u,-). Note that (19) is a (i-dimensional projection prob- 
lem. In particular, rescaled time does not enter as an additional dimension. 
The projection is rather done separately for each time point u £ [0,1]. We 
thus fit a smooth backfitting estimate to the data separately around each 
point in time u. 

By differentiation, we can show that the minimizer of (19) is characterized 
by the system of integral equations 

fhk{u, x k ) ] \ ' — ^ — dx k — rho(u) 
Pj{u,x 3 ) 

together with J rhj (u, w 3 )pj (u, w 3 ) dw 3 = for j = 1, . . . , d. Here, pj and pj^ 
are kernel density estimates, and rhj is a NW smoother defined as 

p ] (u,x 3 ) = ^^I(X t , T e[0A] d )K h (u,^)K h (x 3 ,xi T ), 

T 

p hk (u,x 3 \x k ) = — L- £l(X t>T G [0, l] d )K h (u, 1 



xK h (x 3 ,Xl T )K h (x k ,X k T ), 

T(X,rr(^ fO. tf d ~\Ku ( It. 

T 

x K h (x 3 ,X 3 tT )Y ttT /pj(u,x 3 ). 



T 

rh^x 3 ) = -^—J2l(X t , T G [0, l] d )K h (u, | 
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Moreover, the estimate rho(u) of the model constant at time point u is given 
by mo(u) = T-l ]d Y? t= i I(Xt,T € [0, l] d )K h (u, ±)Y tjT . 

We next summarize the assumptions needed to derive the asymptotic 
properties of the smooth backfitting estimates. First of all, the conditions of 
Section 4 must be satisfied for the kernel estimates that show up in the sys- 
tem of integral equations (20). This is ensured by the following assumption. 

(Addl) Conditions (C1)-(C6) are fulfilled together with (K1)-(K3) for 
Wt,T = 1 and Wt,T = £t,T- The parameter f3 satisfies the inequality /3 > 
max{4,^} and ini ue[0}1])Xe[OA]d f(u,x) > 0. 

In addition to (Addl), we need some restrictions on the admissible band- 
width. For convenience, we stipulate somewhat stronger conditions than in 
Section 4 to get rid of the additional nonstationarity bias from the very 
beginning. 

(Add2) The bandwidth h is such that (i) Th 5 -> oo, (ii) ^i° g g T = o(l) 
with 4> T = loglogT and 9 = min{^, /3(i-2/sj-2/s-3 | and (iii) (T r /i) _1 = 
o(h 2 ) and 2" _r /( r '+ 1 ) = o{h?) with r = min{/5, 1} and p given in (CI). 

Condition (ii) is already known from Section 4. As will be seen in Ap- 
pendix C, (iii) ensures that the additional nonstationarity bias is of smaller 
order than 0{h?) and can thus be asymptotically neglected. The expressions 
for f3 and 9 in (Addl) and (Add2) are calculated as follows: using the for- 
mulas (11) and (13) from Theorem 4.1, we get a pair of expressions for f3 
and 9 for each of the kernel estimates occurring in (20). Combining these 
expressions yields the formulas in (Addl) and (Add2). 

Under the above assumptions, we can establish the following results, the 
proofs of which are given in Appendix C. First, the backfitting estimates 
uniformly converge to the true component functions at the two-dimensional 
rates no matter how large the dimension d of the full regression function. 



Theorem 5.1. Letl h = [2C 1 h,l-2Cih]. Then under (Addl) and (Add2), 
(21) sup \rhj(u, x- 7 ) — mj(u, x^)\ = O p ( J - 



Second, the estimates are asymptotically normal if rescaled appropriately. 

Theorem 5.2. Suppose that (Addl) and (Add2) hold. In addition, let 
9 > ^ and r > ^ to ensure that the bandwidth h can be chosen to satisfy 
Tj 0jl ]d/i 6 — > Ch for a constant c^. Then for any u, x 1 , . . . , x d G (0, 1), 

'rfii(it,x 1 ) — mi(u, x 1 )' 



(22) V T [o,l]^ 2 



m d (u,x d ) - m d (u,x d ) 
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Here, V UtX is a diagonal matrix whose diagonal entries are given by the 
expressions Vj(u,x^) = K^a 2 {u,x^) /pj{u,x^) with kq = fK 2 (ip)dip. More- 
over, the bias term has the form B UjX = y/ch\fl\{u, x 1 ) — 71 (it), . . . , /3d{ u , x<i ) — 
r yd( u )] T ■ The functions /3j(u, •) in this expression are defined as the minimiz- 
ers of the problem 

j \P(u,x) -b -h(x l ) b d (x d )] 2 p(u,x)dx, 

where the minimization runs over all additive functions b(x) = 60 + b\{x l ) + 
•■• + °d( xd ) with f bj(x^)pj(u,x^) dxi =0, and the function /3 is given in 
Lemma C.4 of Appendix C. Moreover, the terms jj can be characterized 
by the equation J a>T,j(u, x 3 )pj(u, x^dx 3 = h 2 jj(u) + o p (h 2 ) , where the func- 
tions ax j are again defined in Lemma C4- 

6. Concluding remarks. In this paper, we have studied nonparametric 
models with a time- varying regression function and locally stationary covari- 
ates. We have developed a complete asymptotic theory for kernel estimates 
in these models. In addition, we have shown that the main assumptions of 
the theory are satisfied for a large class of nonlinear autoregressive processes 
with a time- varying regression function. 

Our analysis can be extended in several directions. An important issue 
is bandwidth selection in our framework. As shown in Theorem 4.3, the 
asymptotic bias and variance expressions of our NW estimate are very simi- 
lar in structure to those from a standard stationary random design. We thus 
conjecture that the techniques to choose the bandwidth in such a design can 
be adapted to our setting. In particular, using the formulas for the asymp- 
totic bias and variance from Theorem 4.3, it should be possible to select the 
bandwidth via plug-in methods. 

Another issue concerns forecasting. The convergence results of Theo- 
rems 4.2 and 5.1 are only valid for rescaled time lying in a subset [Ch, 1 — Ch] 
of the unit interval. For forecasting purposes, it would be important to pro- 
vide convergence rates also in the boundary region (1 — Ch, 1]. This can be 
achieved by using boundary-corrected kernels. Another possibility is to work 
with one-sided kernels. In both cases, we have to ensure that the kernels have 
compact support and are Lipschitz to get the theory to work. 

APPENDIX A 

In this Appendix, we prove the results on the tvNAR process from Sec- 
tion 3. To shorten notation, we frequently make use of the abbreviations 
Y_ t T = Yl~ d+1 , Y t (u) = Y*- d+l (u) and e t = e*" d+1 . Moreover, throughout 
the Appendices, the symbol C denotes a universal real constant which may 
take a different value on each occurrence. 
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Preliminaries. Before we come to the proofs of the theorems, we state 
some useful facts needed for the arguments later on. 

Linearization of m and a. Consider the function m. The mean value 
theorem allows us to write 

d 

(23) m(v,Y t ^(v)) - mCti^-iH) = A™ + £ A£ {Yt-j (v) - Y t _j(u)), 

3=1 

where we have used the shorthands A™ = m{y ^^^v)) — m(u,Y_ t _ 1 (v)) 
and A£™ = A™ , (u,Y_ t _ 1 (u),Y_ t _ 1 (v )) for j = l,...,d with the functions 

A™(u, y, y') = /J djm(u, y + s(y' - y)) ds. 
The terms A™- have the property that 

(24) |A--| < A t := A/dle^lloo < Jf 2 ) + ^dll^U > K 2 ) 

for j = 1, . . . , d with = (-Ki + C m )/c a and A > sup uy \djm(u,y)\. This is 
a straightforward consequence of the boundedness assumptions on m and a. 
See the supplement [22] for details. 

Repeating the above considerations for the function a, we obtain analo- 
gous terms A£ ■ that are again bounded by Aj for j = 1, . . . , d. 

Recursive formulas for Y t> T- For the proof of Theorem 3.4, we rewrite 
Y t) T in a recursive fashion: letting yj~^ an d be values of Y^~^ 2 and 

respectively, we recursively define the functions m^ T by mf') r (y t t Zi) = 
m(^,ylzf) and for i> 1 by 

m t,T\ e t-lii)t-i-l) 

_ „,(*-!) m (0) Lt-i-^iJ") / t-i-d\ t-i-eH-l\ 

Using analogous recursions for the function a, we can additionally define 
functions a^fL for z > 0. With this notation at hand, Y^t can represented as 

*t,T - m t)T [£ t _ x , Y t _i_ ljT ) + %T\ e t-li Y t-i-l,T) £ t- 
Moreover, for i > d we can write 

(i-d) ( t -i y t-i-d) , (i-d) ( t-i yt-i-ds \ 

The term &[ 2/*Z*Zi) can be reformulated in the same way. 



18 



M. VOGT 



Formulas for conditional densities. Throughout the Appendix, the 
symbol fy\w ls used to denote the density of V conditional on W. If the 
residuals e% have a density f e , then it can be shown that for 1 < r < d, 

(nr\ f I I t-r+1 -s s 1 , ( Vt ~ m t,T 

(25) f Yt TlY t-r+i jE -^ Y - s -d^ {ytWt-i , e t - r ,z) = —fe I 



Y 

t-r' 1 -s-l.T 



Here, yt, y^ +i , e t _ s r and z are values of 5^,t, ^t*_ir ' e t-r ari d ^ 



respectively. Moreover, 



-s— d 

•S-LT' 



t-r+l (t-r+s) 



T 



Vt-l 



' m t-rT 



(e 



t-r-H z ) ^ a t-r,T 



■m 



(t-d+s) 
l t—d,T 

and &t,T is defined analogously. 



-d-li^l ^ °t-d,T y^t-d 



(e t _ r _ 1 , z)et- r , ■ 



Proof of Theorem 3.1. Property (i) follows by standard arguments to be 
found, for example, in Chen and Chen [4]. Property (ii) immediately follows 
with the help of (25). Recalling that Yf~f T = Yt-h°) for t ^ ( ui ) can 
again be shown by using (25). 

Proof of Theorem 3.2. We apply the triangle inequality to get 



\Y 



Yt(u)\ < 



Yt ~ Y t 



+ 



Y 



Yt(u) 



and bound the terms \Y^t ~ Yt{^ )\ an d l^iCy) ~~ Yt(u)\ separately. In what 
follows, we restrict attention to the term |lt(^) — Yt(u)\, the arguments for 
\Yt,T — Yt(-f)\ being analogous. 

Notation. Throughout the proof, the symbol ||z|| denotes the Euclidean 
norm for vectors z £ M. d , and ||^4|| is the spectral norm for d x d matrices 
A = (aik)i,k=l,...,d- In addition, ||^4||i = max k=lt _ jd ^2 j=1 \a jk \. Furthermore, 
for z £ M, we define the family of matrices 



B{z) 



1 



z z 




\o i 0/ 

Finally, as already noted at the beginning of the Appendix, we make use of 
the shorthands Y_ tT = Y^ d+1 , Y t (u) = Y^- d+1 {u) and e t = e\- d+1 . 

Backward iteration. By the smoothness conditions on m and a, 
Y (|) - Y t (u) = (A- + A t » + £(A& + A t » (y^ (|) - y^( u )) 
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with A™ 



m^Zt-idO) - rn(u,Y t -i(T)) and A 



in 
t-.j 



Y, for j = 1, . . . , d as introduced in (23). The terms ■ for j = 0, . . . , d 

are defined analogously. In matrix notation, we obtain 



(26) 



with f = (A™ + A t Vt, 0, . . . , 0) T and 



Z t ( - ) -Y t (u)=A t (Y t _ 1 [l ) -y,_i(«) ) +£ 



-4, 



1 







V 







Iterating (26) n times yields 



+ 



n— 1 r 

En".-' 

r=0 fc=0 



y 



t— n- 



Note that the rescaled time argument y plays the same role as the argument 
u and thus remains fixed when iterating backward. Next define matrices Bt 
by 



(27) 



B t = (l + |e t |)B(A < 



with At = AI(||e t _ 1 ||oo < K2) + 5/(||e t _ 1 ||oo > K2). As shown in the prelim- 
inaries section of the Appendix, | A£™ + A^-e t | < A t (l + for j = 1, . . . , d. 
Therefore, the entries of the matrix Bt are all weakly larger in absolute value 
than those of A t . This implies that || Ilfe=o -^-A^ll — II 11^=0 Bt-k\A || with 
z = (\z±\, . . . , \zd,\). Using this together with the boundedness of m and a and 
the fact that |A™ + A^ £ t | <C\^ — n|(l + |e t |), we finally arrive at 



Y 4 



Y t {u) 





t 


< 






T 



with 



ra-1 



Vt 



t,n 



C(l + |e 4 |) + C7^(l + |e t _ r _ 1 | 



r=0 



n* 

k=0 



t-k 



Rt,n - C(l + ||li-n-ll 



IP 

k=0 



t-k 
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Bounding Vt t n AND Rt, n - The convergence behavior of Vt, n an d Rt,n f° r 
n — > oo mainly depends on the properties of the product ||rifc=o^*-fcll* ^ ne 
behavior of the latter is described by the following lemma. 

Lemma A.l. If 5 is sufficiently small, in particular, if it satisfies (31), 
then there exists a constant p > such that for some 7 < 1, 



(28) E 



\{B tk 

k=0 



<C-f n . 



The proof of Lemma A.l is postponed until the arguments for Theo- 
rem 3.2 are completed. The following statement is a direct consequence of 
Lemma A.l. 

(R) There exists a constant p > such that E[i2£ n ] < Cj n for some 7 < 1. 
In particular, Rt )n as n — > 00. 

In addition, it holds that: 

(V) Vt, n < Vt, where the variables Vt have the property that EfV^] < C 
for a positive constant p < 1 and all t. 

This can be seen as follows. First note that 

n— 1 00 

V t , n < C(l + \e t \) + < V t := C(l + |e t |) + J2 R t,v 

r=Q r=Q 

Using the monotone convergence theorem and Loeve's inequality with p < 1, 
we obtain E[Vf] < CE(1 + \e t \) p + E^o E [ i? t,r-]- As the right-hand side of 
the previous inequality is finite by (R), we arrive at (V). 

(R) and (V) imply that |lt(y) — Yt{u)\ < \^ — u\Vt a.s. with variables 
Vt whose pth moment is uniformly bounded by some finite constant C. An 
analogous result can be derived for \Y^t — ^t(fOI • This completes the proof. 

Proof of Lemma A.l. We want to show that the pth moment of the 
product ||rifc=o-^*-fell converges exponentially fast to zero as n— >oo. This 
is a highly nontrivial problem, and as far as we can see, it cannot be solved 
by simply adapting techniques from related papers on models with time- 
varying coefficients. The problem is that the techniques used therein are 
either tailored to products of deterministic matrices (see, e.g., Proposition 
13 in Moulines et al. [19]) or they heavily draw on the independence of the 
random matrices involved (see, e.g., Proposition 2.1 in Subba Rao [20]). 

We now describe our proving strategy in detail. To start with, we replace 
the spectral norm || • || in (28) by the norm || • ||i which is much easier to 
handle. As these two norms are equivalent, there exists a finite constant C 
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such that ||IIfe=o-^-fcll — with B n = \\YYk=oBt-k\\i- Next, we split up 
the term B n into two parts, 

B n = I n B n + (1 - I n )B n =: B„,i + B n , 2 , 

where I n = I{YIk=o J k > nn) with J fc = I(min/ = i i ... i(i |e t _ fc _/| < if 2 ) and a 
constant < k < 1 to be specified later on. Lemma A.l is a direct conse- 
quence of the following two facts: 

(i) There exists a constant p > such that E[£>^ J < Oy™ for some 7 < 1. 

(ii) E[B n , 2 ] < C7™ for some 7 < 1. 

We start with the proof of (i) . Letting <p n = X n with some positive constant 
A < 1, we can write 

E[BU = E[I(B n>1 > MB") + E[I(B n>1 < 0„)B£J 



<(E[B^]P(B nil >^)) 1/2 + 



hP 



It is easy to see that EfB^-J < C / ' n for a sufficiently large constant C, where 
C p can be made arbitrarily close to one by choosing p > small enough. To 
show (i), it thus suffices to verify that 

(29) P(B n ,i ><p n )< C7 n for some 7 < 1. 

For the proof of (29), we write 

P(B B ,i > <P n ) < P(/„ > 0) =P( ^(J fc -E[J fc ]) > K n J 

\fc=0 / 

with kq := k — E[Jfe]. As the variables ej have an everywhere positive density 
by assumption, the expectation E[Jfc] is strictly smaller than one. We can 
thus choose < k < 1 slightly larger than E[J^] to get that < kq < 1. As 
the variables — E[ J^] for = 0, . . . ,n are 2d-dependent, a simple blocking 
argument together with Hoeffding's inequality shows that 

Vfc=0 / 

for some 7 < 1. This yields (29) and thus completes the proof of (i). 
Let us now turn to the proof of (ii). We have that 



Bn,2 = (1-In) II (1 - kt- 



fc=0 



n B(A t _ ki 



k=0 



The random matrix B(A t ^f : ) in the above expression can only take two 
forms: if [|e^_ fc _ 1 || 00 > K2, it equals B(6), and if ||e£_fc_i||oo < K2, it equals 
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B(A). Moreover, if minj = i ) ... i( j \et— k-i\ > A2, it holds that 1 1 ff f ^ ^ 1 1 oo 

for all I = 1, . . . ,d and thus nf=o B(At-k-i) = B(5) d . Importantly, the term 
B U: 2 is unequal to zero only if I n = 0, that is, only if min/ = i v . \et~k~i\ > K 2 
for at least (1 — k)ti terms. From this, we can infer that 



(30) E[B n , 2 ] < E 



.k=0 



\B(A)\\1 n \\B(8)% 



d\\(l-K)n/d 



By direct calculations, we can verify that ||-B(<5) rf ||i < C d 5 with the con- 
stant C d = Yli=o Si=o (fc) th & t om Y depends on the dimension d. Moreover, 
||-B(A)||i < (A + 1). Plugging this into (30) yields 

E[B n , 2 ] < (1 + E|e |)[(l + EM)(A + iy (C d 5)^ ' d ] n . 

Straightforward calculations show that the term in square brackets is strictly 
smaller than one for 

(31) S<[(l + E\e \) d/{1 ~ K) (A + l) Kd ^ 1 -^C d ]~ 1 . 

Assuming that 5 satisfies the above condition, we thus arrive at (ii). □ 

Proof of Theorem 3.3. The proof can be found in the supplement [22]. 

Proof of Theorem 3.4. To start with, note that the process {Y.t} is 
d-Markovian. This implies that 



(3(k) = supsup/3(a(Y t _ kT ),a(Y t+d _ 1T )) 



with 



f3(a(Y t _ kiT ),a(Y t+d _ hT ))=E sup |P(5) - F(S\a(Y t ^ T ))\ 

L Sea(Y t+d _ 1T ) 

In the following, we bound the expression \P(S) — E(S\a(Y_ t _ k T ))\ for arbi- 
trary sets S £ &(Y_ t+d _ 1 T ). This provides us with a bound for the mixing 
coefficients (3(k) of the process {Y^t}- 

We use the following notation: throughout the proof, we let y = y\ +d _i, 

e = e^ +1 and z = zlz k ~ d+1 be values of Y t+d _ 1T , e\z\ +l and Y-t-k.Ti re- 
spectively. Moreover, we use the shorthand 

/j(^jl*) = /y t+J - T |y?^^ 

for j = 0, . . . ,d — 1, where we suppress the dependence on the arguments 
yl + j_i and e in the notation. Finally, note that by (25), the above conditional 
density can be expressed in terms of the error density f e as 

/on\ ft 1 \ 1 , ( Vt+j ~ m t ,T,j(z)' 

( 32 fj(yt+j\z) = r^M — f\ 
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with 



ft + j 



m t , Td (z)=m\^,yi +j _ 1 ,mlt~ 1 ^ 



(k-2),t-k+l 



(k-j+d-l) ( t-k+l \ , (k-j+d-l), t-k+l 
'"'J---^ V t+7— <f— 1' ) ' t+j—d,T \ c t+j- 



L t+j-d,T \°t+j- 

and a t ,T,j{z) defined analogously. The functions j., u^^, . . . were in- 
troduced in the preliminaries section of the Appendix. 
With this notation at hand, we can write 

nS\a(Y t _ k>T )) 

= E[E[I(Y t+d _ hT G S)\et\ + \Y t _ KT }\Y t _ KT ] 



k-l 



\ t—K+1 V 11 

■t+d-l,T\ b t-l i-<_t-fc,T 
1 fc-1 



I(y G 5) J] ./)!//,., V ? ,.,-) J] f e (et- t ) dedy 

j=0 1=1 



and likewise 



d-l 



k-l 



P(5)= / /(yeS)n/j(!fci-jk)nA( e w&. k , r W'fe'fa^ 
^ j=o z=i 

Using the shorthand Y_ = Y_ t _ k T , we thus arrive at 
\F(S)-F(S\a(Y))\ 



< 



d-l d-l 
3=0 j=0 



dy 



k-l 



~[fe{et-i)fY_(z) de dz. 



i=i 



=:(*) 



We next consider (*) more closely. A telescoping argument together with 
Fubini's theorem yields that 



d-i 

(*)<£ 

i=0 
d-l 



i-1 



lfj(ytH\i:)\fi(yt+i\z)-fi(yt+m\ ' ' f 3 {yt+ 3 \z) 



■3=0 



j=i+l 



dy 



d—i „ r „ r „ d—i 

/ / II fj(yt+j\ z ) d yt+d-i---dy t +i+i 

i=o J \J \J j=i+l 



x \fi{yt+i\z) - fi{yt+i\Y)\ dy t+i 
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d-l 
i=0 



t-1 

x II fjiVt+jilO dvt+i-i ■■■dy t 

3=0 



\fi(yt+i\z) - fi(yt+i\Y)\ dy t+ i 



-i i-1 



n fj{yt+j\Y_) dy t+i -i ■ ■ ■ dy t , 

■3=0 



:(**) 



where the last inequality exploits the fact that 

d-l 

-l 



. d-l 

/ II fj(yt+j\ z ) d yt+d- 



dy 



t+i+l 



3=i+l 



is a conditional probability and thus almost surely bounded by one. Using 
formula (32) together with (E3), it is straightforward to see that 



(**) 



1 



°t,T,i(z) 
1 



yt+j - m t ,T,i( z ) 
yt+i - m t ,T,i(Y) 



dy 



t+i 



*t,T,iQO V CT W (Y) 

< C{\m t , T ,i{z) - mt,T,i0O\ + \<r t ,T,i( z ) ~ <rt,T,iQQ\) 

< C(2C m + 2C a )(\m ttT ,i{z) - m w (Y)| + \a t , T ,i(z) - a tjT ,i(Y)\) p , 

where p is some constant with < p < 1. Iterating backward n < k — 2d 
times in the same way as in Theorem 3.2, we can further show that 



(33) 



\m t ,T,i(z) - m t ,T,i(Y) \ + \at,T,i( z ) ~ °t,T,iQQ\ 



d—i 



n 



m=Q 



'l + \\e 



t—j—n—d I 
t-j-n-1 1 



where || • || denotes the Euclidean norm for vectors and the spectral norm for 
matrices. The matrix Bt was introduced in (27). Note that Bt was defined 
there in terms of the random vector el~ d . Slightly abusing notation, we here 
use the symbol Bt to denote the matrix with e t t ~ d replaced by the realization 
el~ d . Keeping in mind that the matrix Bt only depends on the residual values 
el~ d , we can plug (33) into the bound for (**) and insert this into the bound 
for (*) to arrive at 




II B t-j-m 
m=Q 



(i + IK 



t—j—n—d I 
j-n-1 I 
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As a consequence, 



»(S)-F(S\a(Y))\<CE [J2 

\j=i 



II B -r 



m=0 



Using the arguments from Lemma A.l, we can show that for p > sufficiently 
small, the expectation on the right-hand side is bounded by CX n for some 
positive constant A < 1. Choosing n = k — 2d, for instance, we thus arrive at 

\F(S) - P(S\a(Y t _ ktT ))\ < CAM d+1 ) < Cl k 
for some constant 7 < 1. This immediately implies that /3(k) < C^ k . 

APPENDIX B 

In this Appendix, we prove the results of Section 4. Before we turn to the 
proofs, we state two auxiliary lemmas which are repeatedly used throughout 
the Appendix. The proofs are straightforward and thus omitted. 



Lemma B.l. Suppose the kernel K satisfies (C6) and let 1^ 
Cih}. Then for k = 0,1,2, 



sup 



1 T 

Th^ 

t=i 



O 



K h [u 



t/T 



1 1 

h 



K h (u-<p) 



u — tp 
h 



dip 



Th 2 



Lemma B.2. Suppose K satisfies (C6) and let g : [0, 1] xIR^ — > R, (u,x) (->■ 
g(u,x) be continuously differentiable w.r.t. u. Then for any compact set 
ScR d , 



sup 



1 



T 

Th^ Kh 



T 



u- — gl —,x 



T 



■g(u,x) 



O 



1 



Th 2 



+ o(h). 



Proof of Theorem 4.1. To show the result, we use a blocking argument 
together with an exponential inequality for mixing arrays, thus following the 
common proving strategy to be found, for example, in Bosq [3], Masry [18] 
or Hansen [13]. In particular, we go along the lines of Hansen's proof of 
Theorem 2 in [13], modifying his arguments to allow for local stationar- 
ity in the data. A detailed version of the arguments can be found in the 
supplement [22]. 



Proof of Theorem 4.2. We write 
m(u, x) — m(u, x) 



1 



(g v (u,x) + g B (u,x) - m(u,x)f(u,x)) 



-,B 



f{u,x) 
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with 



f(u,x) 



t=l V J 9=1 



, \ d 



1 / t \ 

x ) = Yhd+i Yl Kh [ u ~ t ) II Kh ^ - X lr) £ t,T, 

t=l V > j=l 

g B (u, x) = ^Ju-^H K h (x j - xl T )m(i X t 

t=l V J j=l V 



We first derive some intermediate results for the above expressions: 
(i) By Theorem 4.1 with Wt t = £tTi 



sup \g v (u,x)\ = oJJ^^). 

0,1 IxeS \ V J-n ^ J 



«e[o 

(ii) Applying the arguments for Theorem 4.1 to g B (u,x) — m(u,x)f(u,x) 
yields 

sup \g B (u,x) — m(u,x)f(u,x) 
«e[o,i],xe5 



— E,[g B (u, x) — m(u, x)f(u, x) 



logT 



Th d + l 



(hi) It holds that 



sup \E\g (u,x) — m(u,x)f(u,x)]\ 

u£lh,x£S 

d 

= h 2 Y^2( 2 dim(u,x)dif(u,x) + dlm(u,x)f(u,x)) 
i=0 

with r = min{/9, 1}. The proof is postponed until the arguments for Theo- 
rem 4.2 are completed, 
(iv) We have that 

sup \f(u,x) - f(u,x)\=o p (l). 
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For the proof, we split up the term f(u,x) — f(u,x) into a variance part 
f(u,x) — ~Ef(u,x) and a bias part E/(n,x) — f(u,x). Applying Theorem 4.1 
with Wt t T = 1 yields that the variance part is o p (l) uniformly in u. The 
bias part can be analyzed by a simplified version of the arguments used to 
prove (iii). 

Combining the intermediate results (i)-(iii), we arrive at 
sup \m(u, x) — m(u, x)\ 

< (sup/(M,x)~ 1 )(sup|^ y (u,x)| + sup\g B (u, x) — m(u,x)f(u,x)\) 

with r = min{/9, 1}. Moreover, (iv) and the condition that in£ u e[a,i],xeS f ( u i 
x) > immediately imply that sup/(n,x)~ 1 = O p (l). This completes the 
proof. 

Proof of (hi) . Let K : R — > R be a Lipschitz continuous function with 
support [— qCi,qC\] for some q > 1. Assume that K(x) = 1 for all x £ 
[-Ci,Ci] and write K h (x) = K(f^). Then 

E[g B (-u, x) — m(it, x)f(u, x)] = Qi(u, x) + • • • + Q±{u, x) 

with 



Qi(u,x) 



t=i v y 



and 

<7i(it, x) = E 



n^( 



t,T) 



x <jm( -,Jf t , T 



m(ti,x)| 



g 2 (u,x) = E 



n^( 



4t)U k > 



x <J m{ —,Xt y T 



m 
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q 3 (u,x) = E 



n K h (x= - xy - n Sk (*• - x>(p) ) 



m[ —,X t 



qi(u, x) = E 



— m(u, x) 
m(u, x) 



Li=l 

We first consider Qi(u, x). As the kernel K is bounded, we can use a telescop- 
ing argument to get that | Y[ d j=l K h {x? - X\ T ) - l\ d j=1 K h { X i - X\ (|))| < 
CYX=\ \ K h{x k -X^ T )-K h {x k -X^{^))\. Once again exploiting the bound- 
edness of K, we can find a constant C < oo with l-fT^a^' — X* T ) — Ky l {x k — 
**(?f))l < C\K h (x k - X k T ) - K h {x k - X^{^))\ r for r = min{p, 1}. Hence, 



(34) 

a 
k=l 

Using (34), we obtain 
\Qi(u,x)\ 

T 



n K h (ap - xy) - n K h u - xi (i) ) 

.7=1 .7=1 v v // 



K h (x k -X k )-K h (x k -X k 



< 



c 



d 

E 



,fc=i 



K h (x fc -X t fc T )-K h (x fc -X t fc (^ 



x[]^^-4 T ) 

i=i 



m 



T 



X, 



t.T 



m(u, x) 



with r = min{p, 1}. The term Y\.j=\ Xh{x° — X 3 tT )\m{-^, X^t) — m(u, x)\ in 
the above expression can be bounded by Ch. Since K is Lipschitz, \X k T — 

X k {-!f)\ < y^i,T(^) and the variables Ut^r{^) have finite rth moment, we 
can infer that 

\Qi(u,x)\ 



Th d 



t=i 



u-- IE 



E 

,fc=i 



K h (x K -X? T )-K h [x K -X? 



T. 



c T 

< -J 

~ Th d 
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t=i v j 



£ 

k=l 



Th Ut > T \h 



C 

< 



rpr fod—l+r 



uniformly in u and x. Using similar arguments, we can further show that 
sup u it, £c) | < an( i s ^Pu,x\Qs{ u t x )\ — jv/^-i+r ■ Finally, applying 
Lemmas B.l and B.2 and exploiting the smoothness conditions on m and /, 
we obtain that uniformly in u and x, 



Qi(u, x) = h 2 — s ^{2dim(u, x) d{f(u, x) + d^m^, x)f(u, x)) + o(h 2 

i=0 

Combining the results on Q\(u, x), . . . ,Q^{u,x) yields (iii). □ 



Proof of Theorem 4.3. The result can be shown by using the techniques 
from Theorem 4.2 together with a blocking argument. More details are given 
in the supplement [22]. 

APPENDIX C 

In this Appendix, we prove the results concerning the smooth backfit- 
ting estimates of Section 5. Throughout the Appendix, conditions (Addl) 
and (Add2) are assumed to be satisfied. 

Auxiliary results. Before we come to the proof of Theorems 5.1 and 5.2, 
we provide results on uniform convergence rates for the kernel smoothers 
that are used as pilot estimates in the smooth backfitting procedure. We 
start with an auxiliary lemma which is needed to derive the various rates. 

Lemma C.l. Define Tq = E[Tj 01 ]d]. Then uniformly for u £ Ih, 

(35) ^ = P(A (n) G [0, l] d ) + 0(T^ 1+ ^) + o(h) 

with p defined in assumption (CI) and 



(36) ^MiZ^ = /./logT 



T p \\ Th 

Proof. The proof can be found in the supplement [22]. □ 

We now examine the convergence behavior of the pilot estimates of the 
backfitting procedure. We first consider the density estimates pj and pj^- 
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Lemma C.2. Define v T ,2 = ^logT/Th 2 , v T ,3 = ^logT/Th 3 and b T , r = 
T~ r /i~( rf+r ) with r = min{/3, 1}. Moreover, let Kq(w) = f Kh(w,v)dv. Then 

sup \pj(u,x j ) -pj(u,x j )\ = O p (v T ,2) + 0(b T , r ) + o(h), 

sup \pj(u,x j ) - K (x j )pj(u,x j )\ = O p (v T ,2) + 0(b T , r ) + 0(h), 
uei h ,xi e[o,i] 

sup \p jtk (u,x 3 ,x k ) -p jtk (u,x 3 ,x k )\ = Op(v T ,3) + 0(b T , r ) + o(h), 
u,xi,x k £l h . . . 

sup \pj )k (u,x 3 ,x ) - k (x j )k (x )p j}k (u,x J ,x )| 

= O p (v Tt3 ) + 0(b T>r ) + 0(h). 

Proof. We only consider the term pj, the proof for pj jk being analogous. 
Defining Pj{u^) = (To)" 1 Y?U J ( X *T g %l] d )K h {u^)K h {x 3 ,X 3 t;r ) with 
To = E[Tj 0)1 ]d], we obtain that 

-i 



Pj(u, x 3 



1 + 



r [o,i] d - 


T 


To 




r [o,i] d - 


T 


To 



+ 



Pj(u,x 3 ) 

T [o,i} d - r o x 



To 



By (36) from Lemma C.l, this implies that 

Pj(u,x j ) =pj(u,x j ) + O p (y/\og T/Th) 

uniformly for u G T, and x 3 G [0,1]. Applying the proving strategy of Theo- 
rem 4.2 to pj(u,x 3 ) completes the proof. □ 

We next examine the Nadaraya- Watson smoother rhj. To this purpose, 
we decompose it into a variance part rh^ and a bias part fh^ . The decom- 
position is given by rhj(u, x 3 ) = rhj (u, x J ) + rh?(u, x 3 ) with 

T 

m/(u,^') = — !— V/(X tiT G [0,l] d )Jf h ( u,^if h (aJ\^ T )e t) r/Pi(«,aJ), 
T [o,i] d ^ V T J 

-!— nxt, T e [o, i] d ) A- h (u, i ) a,(^' , 4 T ) 



(u, x 3 ) 



x m 



|)+X>(|,4r)). 



py(it, x 3 ). 



The next two lemmas characterize the asymptotic behavior of and rhj ■ 



LOCALLY STATIONARY NONPARAMETRIC REGRESSION 31 



Lemma C.3. It holds that 



(37) sup \rhf(u,x 1 )\ = O p 

u,xie[Q,l] 



logT 



Th 2 ' 



Proof. Replacing the occurrences of 2~j 0jl ]d in by To = E[Tj 0jl ]d] and 
then applying Theorem 4.1 gives the result. □ 



Lemma C.4. It holds that 
(38) 



sup | rhf (u, x 3 ) — fiTj (u,x J )\ = o p (h 2 ), 



sup \rh?(u,x 3 ) — p,T,j(u, x 3 )\ = O p (h 2 ) 



(39) 

uai h ,x^r h 
with 1^ = [0, 1] \ Ih and 

fi TJ (u,x j ) = a T ,o(u) + a T ,j(u,x j ) 

k p jtk (u,x 3 ,x k ) 



+ 23 / QT > fc ( 
Mi 



U, X 



Pj(u,xi) 



dx k 



+ h 2 / p(u,x) 



p(u, x) 
Pj(u,x j ) 



dx 3 . 



Here, 



h 2 



a T ,o(u) = m (u) + hn\(u) d u m (u) + —k 2 (u) d uu m (u), 



a T>k (u, x k ) = m k (u, x k ) + h 



Ki(u)d u m k (u,x 



K (u)Kl(x k ) 
K (x k ) 



d x km k (u, x k ) 



(3(u,x) = K 2 d u m (u) d u log p(u,x) 



+ 23] K 2d u 
k=i 



m k (u,x k ) d u logp(u,x) + — dl u m k (u,x k 



+ K 2 d xk m k (u,x k )d x k logp(u,x) + ^ d 2 xkxk m k {u,x k \ 

where the symbol d z g denotes the partial derivative of the function g with 
respect to z and k 2 = f w 2 K \w) dw as well as ki(v) = f w l Kh{v,w)dw for 
2 = 0,1,2. 



Proof. As the proof is rather lengthy and involved, we only sketch its 
idea. A detailed version can be found in the supplement [22]. To provide the 
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stochastic expansion of ih^{u,x^) in (38) and (39), we follow the proving 
strategy of Theorem 4 in Mammen et al. [17]. Adapting this strategy is, 
however, not completely straightforward. The complication mainly results 
from the fact that we cannot work with the variables X^t directly but have 
to replace them by the approximations Xt(Jp). To cope with the resulting 
difficulties, we exploit (35) and (36) of Lemma C.l and use arguments similar 
to those for Theorem 4.2. □ 



We finally state a result on the convergence behavior of the term rho(u). 
Lemma C.5. It holds that 



(40) sup\m (u)-m (u)\=Op[ \ —— + h 



u£l h 



logT 



,2 



Th 



Proof. The claim can be shown by replacing Tjo^id with To = E[Tr 0jl id] 
in the expression for rho(u) and then using arguments from Theorem 4.2. □ 

Proof of Theorems 5.1 and 5.2. Using the auxiliary results from the 
previous subsection, it is not difficult to show that the high-level conditions 
(Al)-(A6), (A8) and (A9) of Mammen et al. [17] are satisfied. We can thus 
apply their Theorems 1-3, which imply the statements of Theorems 5.1 
and 5.2. Note that the high-level conditions are satisfied uniformly for u G 1^ 
rather than only pointwise. Inspecting the proofs of Theorems 1-3 in [17], 
this allows us to infer that the convergence rates in (21) hold uniformly over 
u G Ifr rather than only pointwise. A list of the high-level conditions together 
with the details of the proof can be found in the supplement [22]. 
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SUPPLEMENTARY MATERIAL 

Additional technical details (DOI: 10. 1214/12- AOS1043SUPP; .pdf). The 
proofs and technical details that are omitted in the Appendices are provided 
in the supplement that accompanies the paper. 
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